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LVALUATION 


This  report  describes  theoretical  and  experimental 
studies  done  to  improve  the  performance  of  INTEL.  a 
process  for  enhancing  the  s i gnal - to -noi sc  ratio  of 
speech  which  has  been  corrupted  by  wideband  noise.  The 
theoretical  studies  showed  some  of  the  statistical 
properties  of  tnc  faussian  noise  in  the  INTEL  process. 
While  the  intelligibility  of  speech  is  not  mathematically 
definable,  several  experiments  indicate  that  the  rooting 
method  used  in  part  of  the  INTEL  process  tends  to 
separate  the  speech  and  noise  components. 

While  the  phase  is  left  alone  in  this  process, 
experiments  indicate  that  the  correct  phase  will  improve 
t lie  intelligibility  considerably.  Future  v/orl:  should 
investigate  possible  methods  of  reducing  the  phase 
noise. 

The  modified  ccpstrum  gating  process,  as  a result 
of  the  theoretical  and  experimental  work,  should  also 
be  modified  in  future  research. 

The  theoretical  and  experimental  work  performed 
indicated  several  promising  areas  for  future  research. 


kTJinmT  xr'omTi  s'  ~ * 

Capt,  USAF 
Project  Engineer 
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INTRODUCTION 


This  report  describes  theoretical  and  experimental 
studies  done  to  improve  the  performance  of  INTEL,  a process  for 
improving  the  signal-to-noise  ratio  of  speech  which  has  been 
corrupted  by  wideband  noise. 

The  theoretical  study  is  a detailed  statistical  study  of 
the  process,  showing  how  it  works,  why  its  effect  on  speech  is 
different  from  its  effect  on  noise,  and  why  the  cepstrum,  a 
closely  similar  process,  does  not  provide  similar  benefits.  In 
addition,  possible  areas  for  further  research  are  identified 
from  the  theoretical  findings. 

The  experimental  studies  were  explorations  suggested  by 
previous  studies  and  do  not  reflect  the  findings  of  the  theoret- 
ical investigation  mentioned  above.  These  studies  implemented 
various  modifications  to  the  basic  process  in  the  hope  that  they 
would  improve  its  performance.  The  modifications  were: 

Threshold  clipping 

Center  clipping 

Harmonic  emphasis 

Adaptation  to  narrow-band  speech 

In  addition,  a number  of  experiments  were  conducted  to  study  the 

effect  of  phase  on  speech  intelligibility. 
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ANALYSIS  OF  THE  INTEL  PROCESS 


The  noise  reduction  process  known  by  the  name  of  INTEL 
was  largely  developed  empirically,  as  an  attempt  to  capitalize 
on  the  difference  between  the  autocorrelation  functions  of  noise 
and  voiced  speech.  (The  motivation  and  early  phases  of  the 
research  are  described  in  Ref.  1.)  In  order  to  continue  improve- 
ment of  the  process,  however,  a theoretical  understanding  of  the 
process  is  essential.  Accordingly . we  have  done  such  an  analy- 
sis, the  results  of  which  are  given  in  this  section.  After 
shoving  how  the  process  can  be  approximated  by  a simplified 
model,  we  analyze  the  behavior  of  the  system  with  noise  input 
and  with  speech  input.  We  use  the  results  of  these  analyses  to 
provide  a qualitative  description  of  what  happens  when  noisy 
speech  is  processed.  (Because  of  the  complexity  of  the  system, 
we  have  not  been  able  to  provide  a quantitative  description  in 
the  case  of  noisy  speech.) 

2.1  Description  of  Process 

The  basic  process  is  shown  in  block-diagram  form  in 
Figure  1.  The  process  consists  of  the  following  steps: 

1.  The  incoming  signal  is  divided  into  overlapping  seg- 
ments 51.2  msec  long.  (Each  segment  is  processed  separately  and 

i 

the  output  segments  are  overlapped  and  added,  producing  the 
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processed  signal,  :'s  described  in  Ref.  1.  The  steps  given  here 
detail  the  processing  of  one  segment.) 

2.  Triangular  weighting  is  applied  to  the  segment. 

This  reduces  so  ctrum  sidelobes  during  the  process  and,  in  the 
output  speech,  smooths  the  transition  from  one  processing  regime 
to  the  next. 

3.  The  array  is  Fourier  transformed. 

4.  The  absolute  value  of  the  transform  is  taken.  The 
phase  is  saved  for  future  reference. 

5.  The  upper  half  of  the  array  is  set  to  zero  and  the 
n^h  root  of  the  remaining  part  is  taken.  (n  is  usually  2 or  3.) 
We  term  this  process  root  compression;  n is  called  the  root 
compression  factor. 

6.  five ry  odd-numbered  element  is  reversed  in  sign. 

(This  is  simply  a computational  convenience  which  has  the  effect 
of  shifting  the  origin  of  the  next  transform  to  a more  conveni- 
ent location.) 

7.  The  array  is  Fourier  transformed  a second  time.  The 
result  of  this  second  transformation  is  called  the  pseudo- 
cepstrum.  (A  regular  cepstrum  is  generated  by  logarithmic  com- 
pression instead  of  root  compression.) 

8.  Samples  adjacent  to  the  origin  are  set  to  zero. 
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This  operation;,  which  is  the  essential  step  in  the  process,  is 
called  gating. 

From  this  point  on,  the  transformations  of  Steps  - 
through  7 are  undone:  that  is,  the  function  is  inverse- 

transformed,  the  sign  reversals  are  removed,  the  array  i?  raised 
to  the  nfc^  power,  the  phase  is  restored,  and  a second  Inverse 
transform  is  done. 

The  purpose  of  the  gating  operation.  Step  8,  is  to  re- 
move a large  buildup  around  the  origin  which  is  due  to  the  noise. 
When  this  is  done,  we  find  that  the  noise  level  in  the  recon- 
structed signal  is  significantly  reduced.  Clearly  the  sequence 
of  steps  from  4 through  7 can  be  regarded  (emc^'t  for  the 
zeroing-out  of  the  upper  half  of  y£V  as  a single  reversible 
transformation.  Accordingly,  the  first  question  to  be  asked  is: 
why  does  this  transformation  (apparently)  move  most  of  the  noise 
down,  to  within  a few  samples  of  the  origin  and  not  do  the  same 
thing  to  speech?  To  answer  this  question,  we  first  make  a 
simplified  model  of  the  process  and  then  analyze  the  behavior 
of  the  model  when  noise  or  speech  signals  are  applied. 

2.2  Analysis  with  Noise  Input 

The  simplified  model  used  in  this  analysis  is  formed  by 
removing  the  tine-weighting  function  (Step  2) , the  zeroing-out 
of  the  upper  half  of  y 2 (Step  5) , and  the  sign-reversal  of 
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alternate  elements  ^Step  6).  The  removal  of  Step  6 has  no  sig- 
nificant effect  of:  the  process;  the  removal  of  time  weighting 
and  the  zeroing-oat  operations  make  a smrll  difference  that  can 
be  easily  corrected  once  we  understand  the  process  as  a whole. 

A block  diagram  cf  the  simplified  model  is  given  in  Figure  2. 

From  this  figure,  it  will  be  seen  that  we  use  x to  designate  the 
input  signal,  to  designate  the  first  transform,  to  desig- 
nate the  absolute  value  of  y^,  y3  to  designate  the  root  of 
y2 , and  z to  designate  the  second  transform.  When  these  signals 
are  stochastic  processes,  this  fact  will  be  indicated  by  the  use 
of  a wavy  underscore:  e.g.,  y^.  For  convenience,  the  independ- 
ent variable  will  be  omitted  when  no  ambiguity  will  result  from 
doing  so.  We  will  now  describe  the  statistics  of  each  of  these 
signals  when  white  noise  is  applied  to  the  system.  In  this  anal- 
ysis, we  will  assume  a familiarity  with  stochastic  processes  and 
with  the  properties  of  the  discrete  Fourier  transform  (DFT) . 

a.  Input  Signal.  Let  x(t)  be  a sequence  of  real,  zero- 
mean  stationary  Gaussian  noise  samples  of  amplitude  cr^.  Then  x 
has  the  probability  density  function, 


cn 
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Since  x(t)  is  assumed  white,  its  autocorrelation  function  is 
given  by 

R.Cx)  = cr*£(T)  {Z) 

b.  First  Transform  (y^ . Following  Ref.  2,  p.  3G8f, 
we  derive  the  statistics  of  y^f),  the  DFT  of  x.  By  definition. 


X*  ^ i * O)  ^ 

4*0  ~ 


CS) 


where  W - exp(2nj/N).  First,  we  note  that  for  x(t)  real  and 
Gaussian,  the  samples  of  y,  will  be  complex  and  Gaussian.  Next. 


N - ! 


etv, (?)}  =iZ  e{»W?w 

t»o 


(•t.  f-o 

7 O ob\*rw\M. 


00 


Finally , the  variance  oi'  the  spectrum  component  at  frequency  f 

is  equal  to  the  corresponding  term  in  the  DFT  of  R Ct) . For  if 

*7r  ~ 

RxCt)-* •-«*  (f ) , then  by  (3) 


uS, 


Ul*o 

M-t 


-.V. 


u»o 

a 0 Vv/ 
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Then 


E ^1'  s £ * (w)}  W"U  ' 

*A.*0 

Cs) 

U,io 

r-(V>  «,'fx 

"'•  o f,*5i 

Since  the  variance  of  y^(f)  is  E ^ y^*(f )^^(f )^ , this  completes 

the  proof.  (These  results  are  applicable  to  the  DFT  of  any 

stochastic  process  and  will  be  used  again  in  Section  c,  below.) 

We  note  that  the  samples  of  ^ are  independent,  since  they  are 

Gaussian  and,  by  (5),  orthogonal. 

In  our  case,^x  "0  so  y^  is  zero  mean. 

7 7 

RxC'Y)  - JTX  & Ch) ; henceoi(f)  » oyr  (a  constant).  Thus  yi(f)  is 

o 

stationary  with  variance  ov  . Because  the  samples  of  y-.  are 
independent,  the  autocorrelation  function  of  y-i  is 

Rg  (0)  * cr,1  & (0)  Op) 

c.  Absolute  Value.  We  next  consider  Vo.  the  complex 

— 1 1 ' m* 

absolute  value  of  y^.  Since  the  complex-absolute-value  opera- 
tion is  zero  memory,  it  follows  that  if  the  samples  of  y^  are 
independent,  then  so  are  the  samples  of  y£.  It  is  well  known 
that  the  absolute  value  of  a zero-mean  complex  Gaussian  signal, 
has  a Rayleigh  density: 

■f  M ^ e*p  (-y'/z&x)  U M (7) 
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Where  U(y)  is  the  unit  step, 


( Y * ° 

o •♦K«rv/**e 

The  mean  and  variance  of  ^ are 

1*4  * °3sV? 

The  autocorrelation  function  of  y2  is  found  as  follows: 

By,  C^)  * E-fviCO 
By  definition,  %*■ 

For  # nonzero,  the  samples  are  independent;  hence 

R*(9>)  ■ E{ytC<>}ECitl*-*» 

For#  - 0,  Ry.  (<J»  •E{Lyl«)]*} 


-Oy'-V  Vty 


02) 


Hence  Ryt  +(°Vr  + >1/*]$  ($9 

0^) 

d.  Root  Compression  (y-j)  . In  considering  the  n^-root 
function,  y3(f),  the  situation  becomes  more  complicated.  Since 
taking  the  nth  root  is  also  a zero  memory  operation,  the  auto- 
correlation of  y3  follows  the  reasoning  of  equations  10  through  13. 
The  values  of  and  cr  are  found  from  the  density  function  of 
y,.  Following  Papoulis  (Ref.  2,  p.  126f)  (and  using  his  nota- 
tion) , we  have  y * 3 (*)  " * n 

* yn 
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<*'00  " IFT  * ^ ^ 


4 (*«)  - l/°y 


n-i 


Since 


and 


f Y (y)  * ^y  C^8*>/  | 9*  Cy0  I » 

■fy  (y)  * ~nr  e*P  (r*Vz<r* ) uoo 


the  required  density  function  is 


ln-l 


an 


■fy  (y>  =*  ^y"  «p(-y”/2<f;  )^(y) 

~ £.  ~ 

Returning  to  our  own  notation, 

^(y3)  = ysM  (-y^A^)  u W 

We  also  need  the  moments  of  this  density  function. 


“ - y,  A«v  . 


In 


- u. 


C'A'J 


The  mth  moment  la  given  by 

mm  * £y  yv** 

This  can  be  evaluated  by  a change  of  variable:  with 

Zn 

rz<r:  m w 

■ i. J 

Integrals  of  this  form  are  tabulated  in  Ref.  3.  The  result  is 

r(^)  °s) 

From  these  derivations  we  have  t’  e following  particular  results: 

1.  Area  A=  m = tee-1)®  fO)  ml  (not  surprising) 

2.  K«ym,  - r(^) 
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* N'o  r*  /i+h  N 

3.  Second  moment:  » (2CT*  ; l ^ n J 


Variance:  C 

Lim 

J 

*< 

V 

II 

Lim 

cr  z •*  o 

n-»°° 

</ 

4 

CTy  = — 

r 

( -*(4?) 


■*.  r\\  ,-4  f I 

1 V ~ ^ I V / 

The  density  function  for  y (f)  is  plotted  for  several 

~ j 

different  values  of  n in  Figure  3. 

Using  these  results  and  equation  10,  we  have,  for  the 
nth.,r0ot  function 

Ry  C<?>)  3 T(y*  * cry*  S($)  (lbft) 

lr  1^)  ♦ [r^)-  r'<r£r)]  Out.) 

In  the  limit  as  n increases  indefinitely, 

lim  Py  (.0)  * l 


(“*>* 


/ 

/ 

/ 


e.  Second  Transform  (z) . We  finally  turn  to  z(v).  the 
DFT  of  y3*  Using  the  results  of  Section  b above,  and  particu- 
larly  Equations  (4)  and  (5),  will  again  be  a sequence  of 
independent  random  variables  such  that  Ef,z(v)  } " \y35(v)  and 
e{.  *^(v)  } " *<  (v)  where^  is  the  transform  of  (16): 
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Hence 


Z(v)-jiM*k  01) 

where  £ and  b are  random  amplitudes  with  the  statistics, 

#%i  » 

£^feV0  > 

f:.  Removal  of  Simplifications.  At  this  point,  we  can 
remove  the  two  principal  simplifications  in  our  model.  First, 
the  top  half  of  the  spectrum  is  set  to  zero  (Step  5,  p.  3). 

This  is  equivalent  to  multiplying  by  a rectangular  window, 
and  the  effect  will  be  to  convolve  jz(v)  with  a sin  v/v  function. 
This  will  show  up  as  a broadening  of  the  impulse  at  the  origin, 
but  will  have  little  other  visible  effect  on  z, (v)  . Second,  _x(t) 
is  subjected  to  triangular  weighting.  This  will  convolve  y-^(f) 

A 

with  (sin  f/f)4  and  the  resultant  broadening  of  the  peaks  in  y£ 
and  y-,  will  cause  a high-end  fallotf  in  z(v),  (Strictly,  z(v)  is 
multiplied  by  the  Fourier  transform  of  (sin  f/f)^n;  this  trans- 
form cannot  be  expressed  in  closed  form.) 

2 . 3 Comparison  with  Observed  Results 

Figure  A shows  jz(v)  as  plotted  from  an  actual  run  of 
INTEL  with  noise  input.  The  similarity  with  the  theoretical 
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reoult  is  apparent.  The  most  conspicuous  feature  is  the  buildup 
at  the  origin.  In  the  light  of  cur  analysis,  we  know  that  this  is 
the  sin  v/v  shape  resulting  from  the  convolution  just  discussed. 
T.  j first  few  sidelobes  of  sin  v/v  are  in  fact  clearly  visible 
a-  "A"  in  the  figure.  The  rest  of  the  plot  shows  low-level 
noise,  which  corresponds  to  the  term  jb  in  equation  (17). 

The  noise-removal  operation  corresponds  to  removing  the 
buildup  at  the  origin.  Naturally,  the  entire  buildup  cannot  be 
removed,  especially  not  the  component  at  v ■ 0,  because  this 
corresponds  to  the  constant  portion  of  y^(f).  If  the  constant 
were  set  to  zero,  y^,  would  frequently  go  negative.  Since  y^ 
was  derived  from  an  absolute  value,  this  would  be  absurd.  Hence 
only  the  side  lobes  of  the  buildup  are  removed.  The  principal 
result  of  this  operation  is  that  y^'  (the  prime  indicates  the 
value  of  the  function  on  the  "return  trip"  through  the  process) 
will  have  a mean  which  is  less  than  1.  When  y2'  is  computed, 
its  mean  will  be  still  lower,  since  its  mean  is  approximately 
the  nfch  power  of  the  mean  of  y-,'.  When  the  phase  is  restored  to 
y2'  to  generate the  result  will  be  complex  noise  samples  of 
much  smaller  amplitude  than  the  samples  of  y^.  Hence  a reduc- 


tion in  noise  results. 


2.4  Analysis  with  Speech  Input 

It  remains  to  ask  why  speech  is  not  simultaneously 
affected  by  this  process.  The  answer,  briefly,  is  that  speech 
(at  least,  vocalic  speech)  is  a periodic  function  rich  in  har- 
monics. The  result  is  that  z contains  many  components  away  from 
the  origin  which  are  unaffect  .d  by  the  gating  applied  about  the 
origin. 

The  spectrum  of  vocalic  speech  consists  of  a train  of 
harmonics  spaced  at  the  pitcl  frequency  fp.  If  we  transform  a 
short  segment  of  such  speech,  weighted  by  a time  window  function 
w(t) , then  the  result  will  be 

V,  CO  * IK  \ W(*-kfP) 

where  fp  is  the  pitch  frequency, 

W(f)  is  the  transform  of  the  time  window  used, 
a^  is  the  complex  amplitude  of  the  k^1  harmonic. 

W(f)  can  be  assumed  real  without  loss  of  generality,  and  under 
the  processing  conditions  used,  the  overlap  between  harmonics  is 
negligible.  If  there  is  no  overlap,  then  at  any  frequency  f the 
only  contribution  is  that  of  the  nearest  harmonic . Because  of 
this  fact,  we  can  take  absolute  values  and  n*^  roots  inside  the 
summation:  Hence 
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ya(«>  = I y, <*>\  - ^ kJ|w(f-kfP)| 

and 

= [YzO)]^*^  l^W'^-kfp)  ( ,9) 

where  W*  (f)  is  the  root  of  W(f).  Hence  after  the  n^-root 
process,  73(f)  is  still  periodic.  This  means  that  after  the 
second  transformation,  z(v)  will  have  at  least  one  component  not 
at  the  origin,  corresponding  to  the  period  of  73(f).  Indeed, 
z(v)  has  man7  components  because  73  is  not  a sinusoid.  Thus 

7.{y)  - ^ V^A^v-  mvp)  (20) 

where  the  spacing  Vp  ■ l/fr>  and  the  amplitudes  Wjn  depend  ou  the 
shape  of  W' (f ) . Specificall7,  if  the  Fourier  transform  of  W (f) 
is  w* (v) , then  w^  » w' (mvp) . The  w^  values  will  depend  on  the 
specific  function  W(f)  and  on  how  taking  the  n^1  root  of  W 
affect  the  shape  of  w* (v) . A(v)  is  the  transform  of  the  enve- 
lope of  73(f)  and  thus  contains  formant  information  and  some 
talker-identit7  cues. 

Although  the  n^-root  operation  is  nonlinear,  and  hence 
superposition  does  not  appl7,  in  practice  the  presence  of  noise 
does  not  prevent  this  periodic  structure  from  showing  up.  When 
the  part  of  z(v)  around  the  origin  is  gated  out,  this  also 
affects  the  amplitude  of  the  speech  signal.  In  general,  however, 
the  peak  shape  A(v),  which  contains  the  formant  information,  is 
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significantly  wider  than  sin  v/v,  so  a fair  amount  of  it  escapes 
the  gating  process.  (In  fact,  one  of  the  considerations  deter- 
mining the  width  of  the  gate  is  that  as  much  of  A(v)  shall  be 
preserved  as  possible.)  Between  the  surviving  portion  of  A(v) 
and  the  components  found  about  multiples  of  vp,  enough  informa- 
tion is  preserved  to  provide  recognizable  speech  in  the  output 
signal. 

A side  effect  of  the  gating  process  is  that  it  produces 
a slight  enhancement  of  the  high  end  of  y3*  (f)  (above  2.2  kHz) 
and  a region  of  attenuation  from  about  1.2  kHz  to  2 kHz.  When 
y-}'  is  raised  to  the  nc^  pot’er  to  form  , these  effects  are 
exaggerated  and  the  quality  of  the  recovered  speech  is  degraded. 
To  counteract  this  effect,  we  multiply  yp1  by  an  equalizing 
function.  (It  might  be  possible  also  to  correct  this  distortion 
by  shaping  the  edges  of  the  gating  function  but  as  post- 
equalization works  well  enough,  we  have  not  tried  doing  so.) 

2 . 5 Logarithmic  Compression 

A question  which  arises  in  the  course  of  this  analysis 
is  why  compression  by  means  of  the  root  compression  works  while 
logarithmic  compression  does  not.  Using  the  background  provided 
by  the  foregoing  analysis,  it  is  easy  tc  show  that  a logarithmic 
transformation  does  not  produce  the  separation  of  the  noise  com- 
ponents that  the  n^  root  does. 
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If  a logarithmic  conversion  is  substituted  for  the  root 
operation,  the  statistics  of  y-j(f)  are  different.  If  the  log 
conversion  is  given  by 

y3  = k lr. 


then  the  probability  density  function  of ^ is 

to  * “Ktt)  “p  [fi?  “P^k)]  <21) 

(This  is  determined  the  same  way  as  before.)  This  function  is 
plotted  for  various  values  of  k wither*®  1 in  Figure  5.  The  mt^1 
moment  of  this  density  is 

NA 


■-  - ( Y"  wr  (39  “p(%/ 

JL  J ~ 


By  a suitable  change  of  variable,  this  integral  can  be  evaluated 
for  m - 1;  it  does  not  converge  for  higher  m.  The  mean  is  given 

by  \ --  k [1„  Ucr.Vc]  <22> 

when  C is  Euler's  constant  ;s.  5772156649.  The  mean  is  a linear 
function  of  k and  therefore  vanishes  as  k approaches  zero.  The 
st  ndard  deviation,  however,  is  not  finite. 

Notice  that  here  we  have  a situation  that  is  the  oppo- 
site of  the  INTEL  case.  For  INTEL, 

jZ  » 2-  £ (.v)  + fe 
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where  for  increasing  n,  E {a^-»*l  and  E {b^-*0.  Here,  however, 
E •[  a^j  — v 0 and  E{b^}->«o;  hence  we  get  noise  everywhere  and  no 
buildup  at  the  origin.  Since  we  rely  on  this  buildup  to  make 
the  noise  components  separate,  log  conversion  dees  not  lead  to 
the  desired  result. 

In  actual  computation,  of  course,  it  is  not  practical  to 
do  a pure  log  conversion  since  zero  spectrum  samples  can  occur. 
Instead,  below  some  selected  value  a straight  line  approximation 
is  used.  This  probably  prevents  the  second  moment  of  fy  (yg) 
from  blowing  up  as  described  above,  but  it  does  not  alter  the 
fact  that  oy  ^ is  large  compared  to  ny  ^ and  that  therefore  no 

»V.3 

buildup  occurs. 


2.6  Optimization 

Although  these  studies  provide  an  understanding  of  the 
processes  involved,  they  do  not  yield  tidy,  quantitative  answers 
to  such  questions  as  how  best  to  choose  the  root  compression 
factor  and  how  much  improvement  this  optimum  will  yield. 

A little  reflection  will  show  that  the  answer,  if  available, 
would  be  of  little  use.  We  know  empirically  that  factors  less 
than  2 or  greater  than  4 are  unsatisfactory.  The  evidence  also 
indicates  that  the  variation  in  quality,  as  n is  varied  over  the 
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region  from  2 to  4,  is  not  great.  It  seems  highly  unlikely,  in 
view  of  the  distribution  functions  uncovered  in  the  analysis, 
that  there  is  some  sharp  (i.e.,  narrow)  optimum  hidden  somewhere 
in  this  range.  The  analysis  suggests,  in  fact,  that  other  lines 
of  attack  may  prove  more  fruitful.  These  will  be  discussed  in 
Section  4. 
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3.0 


FURTHER  DEVELOPMENT  OF  THE  INTEL  PROCESS 


Under  the  theoretical  studies,  described  in  Section  2, 
we  examined  the  way  root-compression  followed  by  Fourier  trans- 
fo  .-motion  of  the  spectrum  improves  the  separability  of  speech 
and  additive  noise.  During  experimental  studies,  described  in 
this  section,  we  tried  to  find  additional  ways  to  attenuate  the 
components  of  poise,  either  in  the  spectrum  or  in  the  second- 
order  spectrum,  without  at  the  same  time  equally  attenuating 
the  components  of  speech.  Unfortunately,  we  did  not  succeed. 
Nevertheless,  the  techniques  we  examined  are  worth  describing 
since  they  illustrate  the  kinds  of  processing  that  have  been 
tested  in  our  search  for  methods  to  improve  the  INTEL  process. 
3.1  Threshold  Clipping 

As  discussed  earlier,  the  INTEL  process  enhances  the 
S/N  of  a signal  in  the  second-order  spectrum  by  attenuating  a 
region  in  which  the  S/N  is  lower  than  the  overall  value. 
Alternatively,  the  S/N  can  be  raised  by  emphasizing  regions  of 
the  second-order  spectrum  in  which  the  S/N  was  higher  than  the 
overall  value.  Such  regions  oncur  at  integral  multiples  of 
the  period  of  the  vocal  pitch,  at  which  locations  speech  power 
concentrates  in  the  second-order  spectrum.  Based  on  this 
approach,  we  developed  and  tested  a method  of  processing  the 
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transiormed  signal,  which  we  named  Pitch  Zone  Emphasis,  and 
which  is  described  in  a previous  report.*  As  implied  by  the 
name,  the  method  was  to  emphasize  components  of  the  second- 
order  spectrum  in  the  neighborhood  of  multiples  of  the  pitch 
period.  It  proved  to  be  very  effective  when  the  pitch  was 
known  with  a maximum  error  of  ±10  percent.  However,  when  the 
error  exceeded  this  limit,  the  process  emphasized  components 
of  noise  rather  than  those  of  speech,  thereby  making  the  output 
worse  than  the  input.  Obviously,  this  appr >ach  to  processing 
the  transformed  signal,  while  conceptually  correct,  cannot  be 
used  until  improved  methods  of  measuring  pitch  at  S/N  below 
0 dB  are  available. 

Under  the  current  study,  we  explored  another  method  of 
emphasizing  the  speech  components  in  the  second-order  spectrum. 
This  new  method  does  not  attempt  to  locate  the  regions  in  which 
speech  components  are  concentrated.  Instead  it  takes  advantage 
of  the  observation  that  the  amplitudes  of  these  components  tend 
to  be  larger  than  the  amplitudes  of  noise  in  the  same  regions. 

•Using  this  approach,  we  determine  the  average  level  of 
the  noise  at  all  points  in  the  second-order  spectrum.  This 

* Final  report  on  Contract  F30602-73-C-0100 


/ 
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average-level  function  is  used  as  a threshold  for  discriminating 
between  speech  and  non-speech  components.  We  can  use  this 
threshold  in  either  of  two  ways  to  emphasize  the  speech 
components : 

1.  Samples  smaller  than  the  threshold  are  set  to  zero. 
Samples  larger  than  it  are  unaffected. 

2.  The  threshold  is  subtracted  from  the  absolute 
amplitude  of  each  sample  in  the  psuedo  cepstrum. 
(Samples  smaller  than  the  threshold  are  set  to  zero.) 

For  convenience,  we  refer  to  the  function  as  a clipping  thresh- 
old, the  first  method  of  using  it  as  absolute  clipping,  and 
the  second  method  as  center  clipping. 

A series  of  tests  were  run  for  each  of  these  approaches. 
The  amplitude  of  the  clipping  threshold,  which  was  held  con- 
stant during  each  test,  was  varied  from  0.2  to  2.0  times  the 
average  noise  amplitude  function. 

The  signal  regenerated  after  absolute  clipping  of  the 
second-order  spectrum  was  almost  indistinguishable  from  the 
signal  regenerated  without  the  use  of  absolute  clipping.  The 
only  significant  difference  occurred  for  thresholds  that  were 
greater  than  1.5  times  the  average  noise  amplitude  function. 


In  these  particular  tests  the  acoustic  quality  of  the  noise  vas 
transformed  from  a steady  hiss  to  a partial  gurgling  sound. 

Center  clipping  did  enhance  the  signal-to-noise  ratio 
in  the  output  signal.  However,  it  also  tended  to  flatten  the 
envelope  of  the  signal  spectrum,  thereby  suppressing  the  ratio 
of  peak-to-valley  amplitudes  of  the  formants.  The  effect  was 
to  make  the  regenerated  speech  distinctly  less  intelligible. 
This  distortion  is  caused  by  the  tendency  of  center  clipping  to 
emphasize  the  peaks  that  are  centered  at  integral  multiples  of 
the  pitch  period.  Pitch  zone  shaping  avoided  this  form  of 
distortion  by  giving  equal  emphasis  to  all  components  in  a 
narrow  range  around  this  central  peak. 

Tbs  approach  described  above  points  toward  a third 
possible  method  of  using  a clipping  threshold.  In  this  pro- 
posed method,  the  clipping  threshold  would  be  used  to  detect 
signal  components  that  exceeded  it.  Presumably,  if  the  clipping 
threshold  was  set  properly,  most  of  the  time,  these  components 
would  be  those  of  speech.  Whenever  such  a component  was  found, 
it,  and  all  components  within  a narrow  range  around  it,  would 
be  left  unaltered.  All  other  components  would  be  attenuated. 

In  this  way  the  threshold  would  be  used  to  detect  potential 
pitch  zones.  However,  no  attempt  would  be  made  to  identify  the 
correct  pitch  period. 
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3.2 


Pitch  Harmonic  Emphasis 


One  objective  of  our  study  was  to  examine  an  approach 
that  attenuates  the  noise  between  the  pitch  harmonics  in  the 
spectrum  of  noisy  speech.  Obviously,  this  approach  requires 
knowing  where  the  pitch  harmonics  are,  which  is  the  same  as 
knowing  the  piten  frequency.  In  effect,  this  method  is  equiva- 
lent to  passing  the  speech  signal  through  a comb  filter  in 
which  the  comb  spacing  is  equal  to  the  pitch  frequency. 

At  the  outset,  it  is  apparent  that  this  method  cannot 
succeed  if  there  is  significant  error  in  the  pitch  frequency 
measurement.  For  example,  if  the  measurement  is  in  error  by 
five  percent,  the  estimated  location  of  the  tenth  harmonic  will 
be  in  error  by  50  percent  of  the  measured  pitch  frequency.  In 
other  words,  the  harmonic  would  be  located  halfway  between  its 
true  position  and  that  of  one  or  another  of  the  adjacent 
harmonics.  Unfortunately,  the  error  frequently  exceeds  five 
percent  for  signals  in  which  the  intelligibility  or  quality  is 
low  enough  to  require  some  form  of  processing. 

Even  if  the  pitch  frequency  were  known  with  perfect 
accuracy,  the  method  of  comb  filtering  would  not  produce  a 
useful  enhancement  of  the  speech  ruality.  Such  a filter  can 
only  be  used  to  pass  components  at  multiples  of  the  pitch 
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periods.  It  will  pass  such  components  whether  they  are  of 
speech,  noise,  or  speech-plus-noise.  Thus,  at  frequencies 
where  the  pitch  harmonics  were  negligible,  the  comb  creates 
artificial  harmonics  composed  of  the  noise  components  that  pass 
through  the  como.  In  tests  of  this  method  we  found  that  the 
artificial  harmonics  generated  a buzzing  sound  that  tracks  the 
pitch  of  the  speech,  and  that  is  far  more  objectionable  than 
is  the  steady  hiss  of  the  input  noise. 

There  is  a second,  more  basic  reason  why  comb  filtering 
is  ineffective  in  enhancing  speech  quality.  In  a pair  of 
experiments  we  demonstrated  that  the  noise  that  occurs  at  the 
frequencies  of  the  harmonics  degrades  speech  quality  far  more 
than  does  the  noise  that  occurs  between  the  harmonics.  For 
these  experiments  we  generated  a comb  of  noise,  that  is,  a 
noise  signal  in  which  the  noise  was  confined  to  uniformly 
spaced  bands  that  were  spaced  one  pitch  interval  from  center  to 
center.  For  the  first  test  we  added  this  noise  to  speech, 
after  arranging  the  noise  bands  so  that  they  coincided  with 
pitch  harmonics.  For  the  second  test,  we  offset  the  bands  so 
that  they  fell  between  the  harmonics.  In  both  tests  the 
average  level  of  the  noise  was  made  equal  to  that  of  the  speech. 
The  results  clearly  showed  that  the  intelligibility  was  far 
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greater  for  the  second  test  than  for  the  first  one.  In  fact, 
it  va*  possible  to  still  understand  most  of  the  speech  in  the 
second  test  after  the  noise  had  been  raised  to  a level  that 
reduced  the  intelligibility  to  zero  in  the  first  test.  Since 
comb  filtering  cannot  attenuate  the  noise  that  coincides  with 
the  harmonics,  it  is  of  little  potential  value  in  enhancing  the 
intelligibility  of  noise -obscured  speech. 

3.3  Processing  of  Narrow-Band  Speech 

As  originally  developed,  INTEL  was  designed  to  process 
speech  in  a band  from  DC  to  -jOO  Hz.  This  range  corresponds  to 
the  width  of  most  telephone  channels.  However,  it  sometimes 
happens  that  the  bandwidth  of  the  received  speech  is  less  than 
half  this  range.  When  such  a signal  Is  processed  by  INTEL,  the 
relative  amplitudes  of  pitch  harmonics  in  the  regenerated 
spectrum  can  be  severely  altered.  Our  objective  was  to  develop 
a method  of  processing  narrow-band  speech  without  requiring 
alterations  in  the  software  that  implements  the  INTEL  procedure. 

Through  experimentation,  we  found  that,  the  original 
amplitude  relationships  among  the  harmonics  was  maintained  if, 

before  rooting  the  spectrum,  we  replaced  the  noise  components 

\ 

in  the  spectrum  region  above  the  cut-off  frequency  of  the  speech 
by  a DC  level.  This  method  worked  best  when  the  DC  level  was 
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set  equal  to  the  average  level  of  the  spectrum  below  the  cut- 
off frequency.  After  regenerating  the  spectrum,  the  recovered 
level  was  set  to  zero.  Thus,  not  only  does  this  method 
minimize  distortion  in  the  amplitudes  of  the  pitch  harmonics 
for  narrow-band  speech,  but,  by  virtue  of  setting  the  band 
above  the  cut-off  frequency  to  zero,  it  increases  the  S/N  in 
the  regenerated  speech  signal. 

3.4  Investigation  of  Phase  Effects 

One  of  the  most  widely  held  beliefs  in  speech  research 
is  that  the  ear  is  insensitive  to  phase.  By  this  is  meant  that 
for  sine  wave  signals  the  ear  can  detect  changes  in  amplitude 
and  frequency  but  not  changes  in  phase.  The  accuracy  of  this 
belief  is  well  established.  Not  as  well  established  but  almost 
as  widely  held  is  the  belief  that  the  ear  is  insensitive  to 
changes  in  the  relative  phases  in  the  sinusoidal  components  of 
a complex  signal  such  as  speech.  As  a simple  proof  of  the 
general  truth  of  this  belief,  consider  the  apparent  invariance 
in  voice  quality  as  a listener,  who  receives  only  reverberant 
speech  sounds,  moves  about  a live  room. 

There  is  one  condition  for  which  the  ear  is  sensitive 
to  phase,  and  that  is  when  the  relative  phases  of  the  components 
are  changing  rapidly.  One  instance  of  such  a signal  is  noise. 
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In  band-limited  white  noise  the  phase  angle  of  each  spectral 
component  varies  in  a random  manner  with  uniform  probability 
over  the  range  0 to  360  degrees.  If  the  random  variation  in 
phase  is  removed,  then  the  characteristic  quality  of  the  noise 
sound  will  be  transformed  from  a hiss  to  a noisy  buzz.  But 
a complex  spectrum  analysis  of  the  noise  will  still  exhibit  a 
uniform  distribution  of  components  whose  amplitudes  vary  in  a 
Rayleigh  manner. 

Of  greater  relevance  to  our  study  is  the  fact  that  in 
the  case  of  noisy  speech  the  additive  noise  tends  to  randomize 
the  phase  angles  of  the  speech  components.  The  degree  of 
randomization  depends,  of  course,  on  the  relative  amplitudes  of 
the  noise  and  the  speech  component  of  interest.  At  the  outset 
of  the  study  we  hypothesized  that  such  a distortion  of  phase 
would  cause  a correlated  distortion  in  speech  quality  that 
would  contribute  to  the  loss  in  speech  intelligibility  at  low 
signal  to  noise  ratios. 

To  test  the  hypothesis,  we  performed  several  experi- 
ments. In  the  first  one  we  randomized  the  phase  angles  of  the 
components  of  noise-free  speech  by  substituting  for  them  the 
phase  angles  of  corresponding  components  of  noise.  The  result 
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was  a distinct  deterioration  in  the  quality  of  the  speech 
sounds  which,  although  they  were  still  fully  intelligible, 
became  harsh  and  unnatural. 

We  next  performed  an  experiment  that  was  the  inverse 
of  the  first  one.  Noise  was  added  to  speech  at  a level  equal 
to  that  of  the  speech.  Then  the  phase  angles  of  components  in 
the  original  noise-free  speech  were  substituted  for  those  in 
the  noisy  speech.  Comparison  of  the  two  signals  showed  clearly 
that  the  noisy  speech  with  non -randomized  phase  angles  was  the 
more  intelligible  one.  Subjectively  it  appeared  to  be  louder 
against  the  background  of  added  noise. 

Finally,  we  used  the  procedure  of  the  second  experiment 
to  restore  non-random  phase  to  the  components  of  speech  re- 
generated after  INTEL  processing  of  an  input  noisy  speech 
signal.  Here  we  substituted  the  phase  angles  of  the  noise-free 
speech  for  those  in  the  complex  spectrum  of  the  processed  sig- 
nal before  regeneration  of  the  speech  time  waveform.  When 
compared  with  the  normal  output  of  the  INTEL  process  this  sig- 
nal was  shown  to  be  subjectively  louder  and  more  natural,  with 
a corresponding  increase  in  speech  intelligibility. 

None  of  the  foregoing  should  be  Interpreted  to  imply 
that  phase  conveys  information  in  speech  signals.  That  it 

; 
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does  not  was  demonstrated  in  an  experiment  in  which  the  phases 
of  speech  components  were  imposed  on  those  of  noise  in  a signal 
that  contained  only  noise.  Not  only  was  this  signal  not  even 
remotely  speech-like,  it  was  indistinguishable  in  quality  from 
the  original  signal. 

What  is  apparent  from  these  results  is  that  rapid, 
random  variations  in  phase  degrade  the  naturalness  of  speech 
sounds.  For  speech  at  high  S/N  this  degradation  has  no  sig  l- 
f leant  effect  of  speech  intelligibility,  any  more  than  does 
the  lack  of  naturalness  in  speech  that  has  passed  through  non- 
linear phase  networks  or  speech  generated  by  vocoders.  However, 
at  S/N  of  about  0 dB,  or  where  word  intelligibility  is  about  30 
percent,  randomization  of  speech  phase  clearly  contributes  to  a 
reduction  in  intelligibility  beyond  that  caused  by  the  presence 
of  noise  components. 


CONCLUSIONS  AND  RECOMMENDATIONS 

The  theoretical  analysis  has  provided  the  basis  for  an 
understanding  of  INTEL  without,  however,  providing  a basis  for 
specific  quantitative  estimates.  Nevertheless,  it  has  suggested 
areas  for  possible  future  research  that  may  possibly  be  more 
fruitful  than  the  present  experiments. 

The  experiments  which  were  performed  on  this  project 
were  mostly  useful  in  a negative  sense.  That  is,  we  now  know 
several  modifications  that  don't  do  any  good.  These  were  ex- 
periments that  had  to  be  made  sooner  or  later,  however,  so  even 
though  unsuccessful,  they  do  not  represent  wasted  effort. 

The  most  promising  area  for  further  research  now 
appears  to  be  refinements  in  the  gating  of  the  low-quefrency 
region  of  the  pseudo  cepstrum.  In  particular, 

1.  The  DC  term  itself  should  be  attacked.  We  know  it 
is  risky  to  remove  this  component;  we  do  not  know 
whether  it  cannot  be  attenuated. 

2.  Since  we  know  the  noise  contribution  is  a sin  v/v 
function,  we  ought  to  be  able  to  take  advantage 
of  this  fact.  If  we  cannot  subtract  sin  v/v 
directly,  because  of  non-linear  effects  in  the  root 
compression  process,  we  ought  to  investigate 
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analogous  processes,  such  as  subtracting  some 
simple  function  of  sin  v/v. 

Second,  methods  of  reducing  phase  noise  should  also  be 
investigated.  These  include  the  following: 

1.  Use  of  the  complex  spectrum  and  complex  pseudo- 
cepstrum  instead  of  the  amplitude-only  versions 
of  these  functions.  The  complex  functions  retain 
the  phase  data  of  the  input  signal.  It  is  possible 
that  by  low-quefrency  filtering  of  the  complex 
spectrum  the  phase  data  can  be  "enhanced"  in  the 
same  way  as  the  amplitude  data  is  by  the  current 
form  of  INTEL. 

2.  Averaging  of  the  phase  angles  within  each  spectrum 
peak.  For  noise-free  speech,  the  phase  angles  in 
the  central  region  of  a harmonic  peak  vary  linearly 
with  frequency,  with  the  amount  of  variation  pro- 
portional to  the  pitch  glide  and  the  order  of  the 
harmonic.  The  average  phase  angle  within  a peak 
will  be  the  same  as  the  angle  at  the  center  of  the 
peak.  By  averaging  the  phase  angles  across  a peak 
in  the  spectrum  of  noisy  speech  we  should  be  able 
to  improve  the  estimate  of  the  phase  angle  at  the 
peak  center. 
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3.  By  substituting  an  artificial  phase  function  for 
the  randomized  phase  function  of  the  input  signal. 
Such  a function  would  have  to  be  compatible  with 
the  pitch  and  rate  of  change  of  pitch  of  the  input 
signal.  Otherwise,  the  rate  of  change  of  phase  at 
a pitch  harmonic  would  not  be  compatible  with  the 
frequency  of  the  harmonic  in  the  speech  spectrum. 
Obviously,  to  make  these  functions  compatible,  it 
will  be  necessary  to  measure  the  pitch  of  the  input 
speech  signal.  However,  it  may  be  possible  to 
tolerate  some  incompatibility  at  the  high  order, 
weaker  amplitude  pitch  harmonics.  Consequently,  it 
may  not  be  necessary  to  measure  pitch  with  an 
accuracy  greater  than  that  achievable  by  existing 
techniques . 
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Figure  1.  Block  Diagram  of  INTEL 


Simplified  Model  of  INTEL 
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Figure  3.  Probability  Density  Function 
of  Root-Compressed  Noise  for  Various 
Values  of  the  Root-Compression  Factor  n 
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Figure  5.  Probability  Density  Function 
of  Logarithmically-Compressed  Noise  for 
Various  Values  of  the  Log-Compression  Factor  k 
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